Assessment of health state utilities in dermatology: an experimental time trade-off value set for the dermatology life quality index

Background Dermatology Life Quality Index (DLQI) scores are used in many countries as access and reimbursement criteria for costly dermatological treatments. In this study we examined how time trade-off (TTO) utility valuations made by individuals from the general population are related to combinations of DLQI severity levels characterizing dermatologically relevant health states, with the ultimate purpose of developing a value set for the DLQI. Methods We used data from an online cross-sectional survey conducted in Hungary in 2020 (n = 842 after sample exclusions). Respondents were assigned to one of 18 random blocks and were asked to provide 10-year TTO valuations for the corresponding five hypothetical health states. To analyze the relationship between DLQI severity levels and utility valuations, we estimated linear, censored, ordinal, and beta regression models, complemented by two-part scalable models accommodating heterogeneity effects in respondents’ valuation scale usage. Successive severity levels (0–3) of each DLQI item were represented by dummy variables. We used cross-validation methods to reduce the initial set of 30 dummy variables and improve model robustness. Results Our final, censored linear regression model with 13 dummy variables had R2 = 0.136, thus accounting for 36.9% of the incremental explanatory power of a maximal (full-information) benchmark model (R2 = 0.148) over the uni-dimensional model (R2 = 0.129). Each DLQI item was found to have a negative effect on the valuation of health states, yet this effect was largely heterogeneous across DLQI items, and the relative contribution of distinctive severity levels also varied substantially. Overall, we found that the social/interpersonal consequences of skin conditions (in the areas of social and leisure activities, work and school, close personal relationships, and sexuality) had roughly twice as large disutility impact as the physical/practical aspects. Conclusions We have developed an experimental value set for the DLQI, which could prospectively be used for quantifying the quality-adjusted life years impact of dermatological treatments and serve as a basis for cost-effectiveness analyses. We suggest that, after validation of our main results through confirmatory studies, population-specific DLQI value sets could be developed and used for conducting cost-effectiveness analyses and developing financing guidelines in dermatological care. Supplementary Information The online version contains supplementary material available at 10.1186/s12955-022-01995-x.

We performed a 2×2 classification analysis on these two dimensions, aiming for a high degree of agreement between classification results by the two criteria, while trying to keep the number of exclusions at a reasonably low level. We used the F-score as a measure for the degree of agreement between the classification results on the two dimensions. The F-score was calculated as the harmonic mean between the 'precision' and 'recall' rates defined in the confusion matrix below.
Classification by criterion (2) Classification by criterion (1) exclude don't exclude precision: P = n11 / (n11 + n12) exclude n11 n12 recall: R = n22 / (n21 + n22) don't exclude n21 n22 F-score: F = 2PR / (P + R) We plotted the F-scores against the number of joint exclusions, which was defined as the intersection of the two sets 'to be excluded'. We identified an 'efficient frontier' concerning the available (#exclusions; F-score) pairs, indicating the minimal number of exclusions necessary to reach a certain F-score (Fig. S1). In this way, the 9×17 initially considered combinations were narrowed down to 22 efficient combinations, from which the optimal one was to be selected subsequently.

Fig. S1
Decision boundary for sample restrictions based on minimal and median response times

Exclusion by response inconsistency
As regards the maximal tolerable degree of inconsistency in participants' valuations, we imposed that all TTO utility differences with respect to the 'worst possible' health state (H73) must be greater than or equal to a certain threshold [thr_diff], which we varied in the range {-0.40, -0.35, ..., 0.00}. The optimal value of [thr_diff] was to be selected in the next step, conjointly with choosing the optimal value combination for the minimal and median response time thresholds.
To set the final values for the exclusion thresholds we performed another 2×2 classification analysis, whereby dimension (1) was the minimally required TTO utility difference with respect to H73, and dimension (2) comprised vectors of minimally required response times, which were allowed to vary within the previously identified range of efficient combinations. Again, we plotted the F-scores against the number of exclusions; however, joint exclusions were now defined by the union (rather than the intersection) of the two criteria, i.e. respondents were screened out in case either they gave too quick responses or their evaluations were too inconsistent. The efficient frontier concerning the set of (#exclusions; F-score) pairs was used as support for the final decision. Within the range of values eligible for the exclusion thresholds, the number of exclusions varied between 376 and 949, and the F-score varied between 0.129 and 0.554 (Fig. S2). We picked a reasonable-looking combination [thr_diff=(-0.10); thr_min=5; thr_med=10] around the middle of this region, resulting in 656 exclusions and an F-score of 0.320.

S.2 Supplement to the calculation of predicted utilities for traders
Combinations of DLQI severity levels (x) were mapped to predicted utility values (ŷ) by calculating weighted sums according to the corresponding vector of regression coefficients (β) and the regression intercept (α) , and (if applicable) applying the appropriate inverse link function to the linear predictor value thus obtained. This was carried out in different ways depending on the type of regression model. 1) In the case of ordinary linear regression models the usual scalar product formula was appropriate for calculating predicted utility values: ŷ = α + x ' β .
2) In the case of censored regression it was necessary to apply left-and right-censoring at the corresponding lower ( y L =0) and upper ( y U =1) thresholds: ŷ = max( min(α + x ' β, 1), 0) 3) In the case of ordinal regression we applied continuity correction in proportion to the relative position of the estimated latent variable value ( y * = α + x ' β) between the lower (γ L ) and upper (γ U ) thresholds separating the predicted discrete TTO utility category (ŷ d ) from the categories below (ŷ d − 0.05 ) and above (ŷ d + 0.05) : 4) In the case of beta regression the inverse link function g −1 (⋅) had to be applied to the linear predictor α + x ' β . By our choice of the probit link, g −1 (⋅) was equal to the standard normal cumulative distribution function Φ(⋅) so that predicted utilities were obtained in the form ŷ = Φ(α + x ' β) . 5) In the case of the two-part linear regression model first we applied the usual linear formula (ẑ = α + x ' β) to estimate the relative disutility (z) from the health state (x) , which we further multiplied by the sample mean effective scale range (m (λ )) . Finally, the predicted utility was calculated by subtracting the scaled disutility from the utility of perfect health: ŷ = 1−m (λ )ẑ .
6) In the case of the two-part censored regression model we combined steps of the calculation as with models (2) and (5). First, the relative disutility (z) was estimated using the left-censored regression ẑ = max(α + x ' β , 0) . Then, the predicted utility was obtained in the form ŷ = 1−m (λ )ẑ .

7)
In the case of the two-part beta regression model we combined steps of the calculation as with models (4) and (5). First, the estimated relative disutility (z) was calculated by applying the inverse probit link function to the linear predictor: ẑ = Φ(α + x ' β) . Then, the predicted utility was obtained in the form ŷ = 1−m (λ )ẑ .

S.3 Supplement to the effects of sample exclusions
The screening procedure was successful in enhancing the quality of the sample in terms of response times and consistency of valuations. Concerning the time taken to complete the valuation tasks, the sample mean was 14.8 seconds for the shortest response time across the five health states presented, whereas for the middle of the five response times the sample mean was 28.2 seconds. The overall sample mean concerning the average of the five response times was 36.5 seconds. For comparison: the same values for the initial sample were 9.4, 17.9, and 24.9 seconds, respectively.
Consistency of the valuations with respect to the 'worst possible' health state (H73) was significantly improved, as well. The mean difference between the TTO utility assigned to state H73 and the lowest of the four other valuations was -0.100, and the mean difference with respect to the average of the four other valuations was -0.215. For comparison: the same values for the initial sample were 0.045 (positive) and -0.047, respectively.

S.4 Supplement to cross-validation outcomes
Cross-validation (CV) was essential for eliminating model variables which did not have a consistently negative effect in every subsample (suppl. Tables S17, S18). CV fit indices improved monotonically along the model selection procedure (suppl. Tables S12-S14